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Abstract. In lattice QCD computations a substantial amount of work is spent in solving dis- 
cretized versions of the Dirac equation. Conventional Krylov solvers show critical slowing down for 
large system sizes and physically interesting parameter regions. We present a domain decomposi- 
tion adaptive algebraic multigrid method used as a preconditioner to solve the "clover improved" 
Wilson discretization of the Dirac equation. This approach combines and improves two approaches, 
namely domain decomposition and adaptive algebraic multigrid, that have been used separately in 
lattice QCD before. We show in extensive numerical test conducted with a parallel production code 
implementation that considerable speed-up over conventional Krylov subspace methods, domain de- 
composition methods and other hierarchical approaches for realistic system sizes can be achieved. 
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1. Introduction. Lattice QCD simulations are among the world's most de- 
manding computational problems, and a significant part of today's supercomputer 
resources is spent in these simulations [H . Our concern in this paper is three-fold: 
We want to make the mathematical modeling related with QCD and lattice QCD 
more popular in the scientific computing community and therefore spend some effort 
on explaining fundamentals. On top of that we develop a new and efficient adaptive 
algebraic multigrid method to solve systems with the discretized Dirac operator, and 
we show results for a large number of numerical experiments based on an advanced, 
production code quality implementation with up-to-date physical data. 

The computational challenge in lattice QCD computations consists of repeatedly 
solving very large sparse linear systems 

Dz = h, (1.1) 

where D = D{U, m) is a discretization, typically the Wilson discretization, of the Dirac 
operator on a four-dimensional space-time lattice. The Wilson Dirac operator depends 
on a gauge field U and a mass constant m. In recent computations lattices with up 
to 144 X 64"^ lattice points have been used, involving the solution of linear systems 
with 452,984,832 unknowns [TJ HI [51 HOI [H] . Usually these linear systems are solved 
by standard Krylov subspace methods. Those suffer from critical slowing down when 
approaching the physically relevant parameter values (e.g., physical mass constant or 
lattice spacing a — >■ 0) . Thus it is of utmost importance to develop preconditioners for 
said methods which overcome these scaling issues. The most commonly used precondi- 
tioners in lattice QCD computations nowadays are odd-even preconditioning [T51 , 
deflation techniques [35* and domain decomposition approaches [331 El] ■ While these 
approaches yield significant speed-ups over the unpreconditioned versions, their scal- 
ing behavior is unchanged and critical slowing down still occurs. Comparing with 
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established theoretical results for domain decomposition approaches for elliptic PDEs 
this behavior is to be expected since there the condition number of the preconditioned 
system scales with H/h, where h is the lattice spacing and H the block size pi]. 

Motivated by the potential (e.g., for elliptic PDEs) of convergence independence 
of the discretization mesh-size, multigrid methods have been considered in the lattice 
QCD community as well. However, due to the random nature of the gauge fields 
involved, the treatment of the lattice Dirac equation by geometric multigrid methods, 
i.e., methods based solely on the underlying PDE, has been elusive for the last twenty 
years [7l [151 EH US] . With the advent of adaptive algebraic multigrid methods effec- 
tive preconditioners for QCD calculations could be constructed in recent years. The 
pioneering work from [31 1121 137| showed very promising results. There, an adaptive 
non-smoothed aggregation approach based on '13J has been proposed for the solution 
of the Wilson Dirac system of equations. 

Within the physics community, another hierarchical technique, the recently pro- 
posed domain decomposition type solver named inexact deflation developed in |35j is 
widely used. A well-optimized code for this solver is publicly available Inexact 
deflation can be regarded as an adaptive method as well. It performs a setup phase 
which allows the construction of a smaller system, the little Dirac operator, which is 
then used as part of an efficient preconditioner. Although there is an intimate con- 
nection with the aggregation based multigrid approach from inexact deflation 
seems to have been developed completely independently. As a consequence, the in- 
exact deflation method does not resemble a typical multigrid method in the way its 
ingredients are arranged. In particular, it requires the little Dirac system to be solved 
to high accuracy in each iteration. 

In this paper we present a multigrid method that combines aspects from |35] . 
namely a domain decomposition smoother, and from non-smoothed aggregation as in 
[31 [37] . Our approach elaborates on the multigrid methods from [3l [37] in that we 
use a domain decomposition method as the smoother instead of the previously used 
Krylov subspace smoother. This allows for a natural and efficient parallelization, 
also on hybrid architectures. Moreover, we substantially improve the adaptive setup 
from 131 [37] and [3S] in the sense that less time is required to compute the operator 
hierarchy needed for an efficient multigrid method. Our approach can also be regarded 
as turning the domain decomposition technique from |35] into a true multigrid method. 
The little Dirac system now needs to be solved only with low accuracy. This allows, 
in particular to apply the method recursively and thus allows for a true and more 
efficient multigrid instead of just a two-grid method, a feature which is impossible to 
do with the approach from . 

The paper is organized as follows. In Section[2]we give an introduction into lattice 
QCD for the non-specialist and we introduce the domain decomposition Schwarz 
method in this context. In Section [3] we first outline algebraic multigrid methods in 
general and then focus on aggregation based approaches. Thereby we address the 
peculiarities of lattice QCD and explain different possible adaptive strategies for the 
construction of the multigrid hierarchy. The inexact deflation method from ^35] is 
discussed in Section [4l where in particular we point out the differences to a multigrid 
method and describe the adaptive nature of its setup. In Section[5]we finally formulate 
our multigrid approach, for which we present thorough numerical tests in Section [6] 

2. Lattice Quantum Chromodynamics. Quantum Chromodynamics (QCD) 
or the theory of strong interactions, is a quantum field theory in four-dimensional 
space-time and part of the standard model of elementary particle physics. It has a 
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high predictive power, i.e., a small number of free parameters. Predictions that can 
be deduced from this theory are amongst others the masses of hadrons, composite 
particles bound by the strong interaction (e.g., nucleon, pion; cf. |19)). The masses 
of hadrons and many other predictions have to be obtained non-perturbatively, i.e., 
via numerical simulations requiring the discretization and numerical evaluation of the 
theory. After a brief description of continuum QCD we introduce its discretization 
on a hyper-cubic lattice and discuss the need of iterative methods, e.g., Krylov sub- 
space methods, for the solution of the arising linear systems of equations. Due to 
the ill-conditioned nature of these systems, preconditioning is advised and the use 
of a domain decomposition method is discussed as a prerequisite for our multigrid 
construction. 

2.1. Continuum QCD. The degrees of freedom of QCD are matter fields, called 
quarks, and gauge fields, called gluons. The governing system of partial differential 
equations of QCD are the Dirac equations 

{V + m)ip^r] (2.1) 

which define the dynamics of the quarks and the interaction of quarks and gluons 
for a given gluon field background. Here, ip = ip{x) and rj = ri{x) represent matter 
fields. These depend on x, the points in space-time, x = {xQ,Xi,X2,x^j^ The gluons 
are represented in the Dirac operator V to be discussed below, and m represents a 
scalar mass parameter not depending on x. This mass parameter sets the mass of the 
quarks in the QCD theory. 

In the continuum theory the Dirac operator V can be written as 

3 

where 9^ — d/dXf^ and Af^{x) is the gauge field. The anti-hermitian traceless matrices 
A^{x) are elements of su(3), the Lie algebra of the special unitary group SU(3). 



The quark fields ip and rj in (2.1) carry two indices that are suppressed, i.e., 
= tpoT- These indices label internal degrees of freedom of the quarks; one is called 
color (c — 1, 2, 3) and the other spin {a = 0, 1, 2, 3). At each point x in space-time, we 
can represent the spinor il'{x), i.e. the quark field ip at a, given point x, as a twelve 
component column vector 

ij{x) = (-0io(a;),V'2o(a;),V'3o(a;), -011(2;), •••,V'33(a;))'^. (2.2) 

In case operations act unambiguously on the color but differently on the spin degrees 
of freedom we use the notation ip^ to denote those components of the quark field 
belonging to the fixed spin index a. For a given point x, ^aix) is thus represented 
by a three component column vector ipaix) = {ipi(T{x),ip2(j{x),ip3a{x))'^ ■ The value 
of the gauge field at point x is in the matrix representation of su(3) and acts 
non-trivially on the color and trivially on the spin degrees of freedom, i.e, {A^'il!){x) = 
{h® A^,{x))ip{x). 

The 7-matrices 7oj7i,72:73 S C^^"* act non-trivially on the spin and trivially on 
the color degrees of freedom. They are hermitian and unitary matrices which generate 



^Physical space-time is a four dimensional Minkowski space. We present the theory in Euclidean 
space-time since this version can be treated numerically. The two versions are equivalent, cf. I36| . 
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the Clifford algebra CUiC), i.e., they satisfy 



7/i7i^ + 7i^7m = 2(5^1^/4 for ^m.v ^Q, 1, 2, 3. (2.3) 

Unlike the gauge fields A^, the 7-matrices do not depend on x. The multiplication of 
a 7-matrix with -0 is defined by {j^ip){x) = (7^ (8) /3)i/'(x). 

The covariant derivative + A^^ is a "minimal coupling extension" of the deriva- 
tive dfj^, ensuring that {{d^-\- A^)'ip){x) transforms in the same way as ^j{x) under local 
gauge transformations, i.e., a local change of the coordinate system in color space. 
As part of the covariant derivative the A^ 's can be seen as connecting different (but 
infinitesimally close) space-time points. The combination of covariant derivatives and 
the 7-matrices ensures that Vip^x) transforms under the space-time transformations 
of special relativity (Lorcntz-transformations) in the same way as ^{x). Local gauge 
invariance and special relativity are fundamental principles of the standard model of 
elementary particle physics. 

2.2. Lattice QCD. In order to compute predictions in QCD from first princi- 
ples and non-perturbatively, the theory of QCD has to be discretized and simulated 
on a computer. The discretization error is then accounted for by extrapolation to the 
continuum limit based on simulations at different lattice spacings. One of the most 
expensive tasks in these computations is the solution of the discretized Dirac equa- 
tion for a given right hand side. In this section we give a brief introduction into the 
principles of this discretization and discuss some properties of the arising linear op- 
erators. Since the discretization is typically formulated on an equispaced lattice, this 
treatment of QCD is also referred to as lattice QCD. For a more detailed introduction 
to QCD and lattice QCD we refer the interested reader to [T71 [Ml 

Consider a four-dimensional Torus T. On T we have a periodic Nt x lattice 
C gT with lattice spacing a and nc — Nf lattice points. In here Ng denotes the 
number of lattice points for each of the three space dimensions and Nt the number 
of lattice points in the time dimension. Hence, for any x,y G C there exists a, p G 
such that 

y = X + a ■ p, i.e., = x,, + a ■ p^ ior fi = 0, 1, 2, 3. 
For shift operations on the lattice, we define shift vectors /i e by 

{a ^ — V 
else. 

A quark field is defined at each point of the lattice, i.e., it is a function 

X I— >■ "0(2;) 

on the lattice C which maps a point a; € £ to a spinor ip{x). As in continuum QCD, 
this spinor again has color and spin indices i^ca, c = 1, 2, 3, ct = 0, 1, 2, 3. For future 
use we introduce the symbols C and S for the color and the spin space, i.e. 

C = {1,2,3}, 5 = {0,1, 2, 3}. 

The gauge fields A^{x) connecting infinitesimally close space-time points in con- 
timmm QCD have to be replaced by objects that connect points at finite distances. 
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Fig. 2.1. Naming conventions on the lattice. 



Fig. 2.2. T/ie clover term. 



To this purpose variables Ufj,{x) are introduced. U^{x) connects x and a; + //, so that 
we regard U^{x) as being associated with the link between x and a; + /i. The link 
between x + ji and a;, pointing in the opposite direction, is then given by [/^(a;)^^. The 
matrices (x) are an approximation to the path-ordered exponential of the integral 
of along the link. They satisfy 

C/^(x) G SU(3), in particular U^,{x)'~'^ = U^{x)" . 



Figure 2.1 illustrates the naming conventions on the lattice. Ufj_{x) is called a gauge 
link and the set of all gauge links {J7p(a;) : a; € £, /.t = 0, 1, 2, 3} is called configuration. 

The covariant derivative of the continuum theory can be discretized in many ways. 
Here we restrict ourselves to the widely used Wilson discretization (cf. 50 ), noting 
that the multigrid solver developed in this paper is in principle applicable to any 
discretization resulting in local couplings. 

We define forward covariant finite differences 



Uf_,{x)i)„{x + fl) - ijjaix) 



and backward covariant finite differences 

, , i^aix) -Ufi^ix- fi)ij„{x- ft) 
(A^t/i^) (a;) = . 

Since (A'^)^ = — A^, the centralized covariant finite differences (A'' + A^)/2 are 
anti-hermitian. The simplest discretization of the Dirac operator is then given by 



D 



N 



3 

E 

M=0 



7^ (g, (A^ + A^) /2. 



This so-called naive discretization suffers from unphysical eigenvectors to eigenvalues 
of small magnitude (also known as the "species doubling effect" or "red-black instabil- 
ity"). This is a standard phenomenon when discretizing first order derivatives using 
central finite differences, cf. [15.. Wilson introduced the stabilization term aA^A'^, a 
centralized second order covariant finite difference, to avoid this problem. The Wilson 
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discretization of the Dirac operator is thus given by 



Dw^'^I+ ^^(7M$5(A^ + A^)-a/4®A^A^), (2.4) 

fj.=0 

where the mass parameter mg is used to tune the mass of the quark to its physical 
value. 
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The commutativity relations (2.3) of the 7- matrices imply a non-trivial symmetry 
of Dw- Defining 75 = 70717273 we have 757^ = -7^75 for /i = 0, 1, 2, 3, and since 7^ 
and 75 are hermitian we see that 757^^ is anti-hermitian. Thus the operator (757^^) (8) 
(Ap+A'') is hermitian, being the product of two anti-hermitian operators. To describe 
the resulting T^- symmetry of the Wilson Dirac operator, we define — (8>75 <8> ^3 
and have (T^Dw)^ — ^^Dw- To reduce the order of the discretization error as a 
function of a, a so-called Sheikholeslami-Wohlert or "clover" term (cf. [42] and Figure 



2.2 ) is added to the lattice Wilson Dirac operator 



3 

where (Q^i.?A<j) (x) = Qp.i,{x)'ip^{x) with 

Q^.Ax) = U^ix)U^{x + fi)U^{x + i))"U^{x)"+ 

U^{x) C/^(.T - /i + £>)^ U^{x - fi)" Uf,{x - /!) + 

U^{x - fl)" U^{x - fi-v)" Uf,{x - ji-v) U^{x - {>)+ 

U^{x - D)" Uf,{x - v) U^{x -C' + fi) Uf,{x)". 

The clover term is diagonal on the lattice C. It removes C'(a)-discretization effects 
from the covariant finite difference discretization of the covariant derivative (for appro- 
priately tuned Csw)- The resulting discretized Dirac operator D thus has discretization 
effects of order O(a^). It is again Fs-symmetric, i.e., we have 

{T5D)" = Tr^D. (2.6) 

The Fs-symmetry induces a symmetry on the spectrum of D. For future use, we 
state this in the following lemma. 

Lemma 2.1. Every right eigenvector tpx to an eigenvalue X of D corresponds to 
a left eigenvector ip^ — T^^x to the eigenvalue X of D and vice versa. In particular, 
the spectrum of D is symmetric with respect to the real axis. 

Proof Due to = T^DT^ we have 

D^x = Mx ^ i^xD" = A< ^ {T5i^x)"D = A(F5^a)''- 

□ 

Summarizing, D G c^x" jg ^ sparse matrix which represents a nearest neighbor 
coupling on a periodic 4D lattice. The lattice has nc — NtN^ sites, each holding 



12 variables, so that n = 12n£. D has the symmetry property (2.6 1, depends on 
a mass parameter mo, the Sheikholeslami-Wohlert constant Csw, and a configuration 
{Ufj^{x) : X Cz C, fj, — 0,1, 2, 3}. For physically relevant mass parameters, the spectrum 
of D is contained in the right half plane, cf. Fig. |2.3| and Fig. |2.4| 

While the continuum Dirac operator is normal, the Wilson Dirac operator is not, 
but it approaches normality when discretization effects become smaller. For small 
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Fig. 2.3. Spectrum of a 4* Wilson Dirac Fig. 2.4. Spectrum of a 4** "Clover irn- 

operator with mo = and Csw = 0. proved" Wilson Dirac operator with mo = and 

Csw — 1 • 



lattice spacing, large lattice sizes and physically relevant mass parameters we can 
thus expect that the whole field of values J^{D) = {ip^ D^p : ^p^^p = 1} of D is in the 
right half plane. 

To explicitly formulate D in matrix terms we fix a default representation for the 
7-matrices as 



70 



resulting in 



,7i 



,72 



,73 
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-V 



Thus 75 acts as the identity on spins and 1 and as the negative identity on spins 2 
and 3. D is then given via 



a 32a 



^.h'—O 



1 ^ 

fj,=0 

1 ^ 

- ^ E ((-^4 + 7,.) ® - A)) - A) 



2.3. Domain Decomposition in Lattice QCD. For ease of notation we from 
now on drop the lattice spacing a, so that the lattice C is given as 

C = {X = {xo,Xi,X2,X3), 1 < Xq < Nt, 1 < Xi_,X2,Xs < Ng}. 

Let us also reserve the notation block decomposition for a tensor type decomposi- 
tion of jC into lattice-blocks. The precise definition is as follows. 

Definition 2.2. Assume that {7^, ... 7^} is a partitioning of {1,..., Nt} into 



£o blocks of consecutive time points, 

T° ^ {t,-i + I, . . . ,tj}, J = 1, . . . ,£o, ^ to < ti . . . < ti„ ^Nt, 

and similarly for the spatial dimensions with partitionings {T^ , ■ ■ ■ 7^^}, /i = 1, • . • , 3. 

A block decomposition of C is a partitioning of C into £ = ^o^i^2^3 lattice-blocks 
Li, where each lattice-block is of the form 

Accordingly we define a block decomposition of all 12nc variables inV = CxCxS into 
£ blocks Vi by grouping all spin and color components corresponding to the lattice-block 
(Li, %.e., 

Vi = CixCxS. (2.7) 



Since the systems arising in lattice QCD calculations tend to have hundreds of 
millions of unknowns they require the use of parallel computers. For this reason 
and due to the fact that, as a rule, naive domain decomposition is already used to 
parallelize the matrix vector product Dz which is needed for Krylov subspace methods, 
it is natural to also use a domain decomposition approach as a preconditioner. 

The method of choice here is a colored version of the multiplicative Schwarz 
method IHJ HI] . Since the discretized Dirac operator has only nearest-neighbor cou- 
plings, only two colors are needed. This red-black Schwarz method for the solution of 
Dz = 6 is given in Algorithm [l] for a block decomposition of the lattice and variable 



blocks Vi according to (12. 71). Figure 2.5 illustrates a 2D example. The corresponding 



trivial embeddings and block solvers are denoted by 

Zv, : V, ^ V and B, = Iv^I^^DIv^r^Iv^. 
In each iteration, for the updates in lines [6] and 10 we have to solve the local systems 



D,e^ = I^T with A = Iv^DIv,, (2.8) 
Di representing the restriction of D on the lattice-block Vi. 



Algorithm 1 Red-Black Schwarz 

1: input: 2, b, v 

2: output: z 

3: for fc = 1 to 1/ do 

4: r b — Dz 

5: for all red i do 

6: z ^ z + Bir 

7: end for 

8: r b— Dz 

9; for all black i do 
10; z z + Bir 
11; end for 
12; end for 
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With the shorthand Bcoior = J2iecoior ''^'^'^ 

K — Bhlack {I - D Bred ) + Bred 

we can summarize one iteration — 1) of Algorithm [T] as (cf. 

z^{I - KD)z + Kb. (2.9) 

Since the solution z* ~ D^^b satisfies z* = {I — KD)z* + Kb, one iteration of 
Algorithm [T] updates the error e — z — z* via 

{I -KD)e, 

with / — KD the error propagation operator, 

EsAP =I-KD = {I- Buack D){I - Bred D) . 

The red-black Schwarz method has been introduced to lattice QCD in [M] and 
been used ever since in several lattice QCD implementations as a preconditioner (cf. [2 
[23l [35]). In this context red-black Schwarz is also known as Schwarz Alternating 
Procedure (SAP). In what follows the application of v iterations of SAP to a vector 
b with initial guess 2; = is denoted by the linear operator 



SAP^-KY^il-DKfb. 

k=0 



This representation follows by repeated application of (2.9). Note that EgAP = 



I - MsApD with MsAP = M 



(1) 

SAP- 




Fig. 2.5. Block decomposed lattice (re- 
duced to 2D) with 2 colors 
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number of eigenvalue 

Fig. 2.6. Error component reduction on a 4* 
lattice with block size 2* 



Typically the solution of the local systems (2.8 1, required when computing Bir, is 



approximated by a few iterations of a Krylov subspace method (e.g., GMRES). When 
incorporating such an approximate solver, the SAP method becomes a non-stationary 
iterative process. Hence it is necessary to use flexible Krylov subspace methods like 
FGMRES or GCR in case that SAP is used as a preconditioner (cf. [2311341140] ). 

It turns out that SAP as a preconditioner is not able to remedy the unfavorable 
scaling behavior with respect to system size, quark mass and physical volume of Krylov 
subspace methods. When analyzing this behavior, one realizes that SAP reduces error 



components belonging to a large part of the spectrum very well, but a small part is 
almost not affected by SAP. We illustrate this in Figure |2.6| where the horizontal 
axis represents the eigenvectors u of D in ascending order of the absolute value of 
the corresponding eigenvalue. The vertical axis gives the ratio |ji?5Api'||/||i^|| , see 
Figure |2.6[ The ratio is small for larger eigenvalues and becomes significantly larger 
for the small eigenvalues. This behavior is typical for a smoother in an algebraic 
multigrid method which motivated us to use SAP in this context. 

3. Algebraic Multigrid Methods. Any multigrid method consists of two com- 
ponents, namely a smoother and a coarse grid correction [TSl 113 |M1 HT] . Typically, 
the smoother can be chosen as a simple iterative method. This can be a relaxation 
scheme like Jacobi or Gauss-Seidel or their block variants as well as Krylov subspace 
methods. Given the properties of SAP presented in the previous section we choose 
SAP as our smoothing scheme in the QCD context. 

Let us reserve the term near kernel for the space spanned by the eigenvectors 
belonging to small (in modulus) eigenvalues of D. Since SAP is not able to sufficiently 



remove error components belonging to the near kernel (cf. Figure 2.6), the multigrid 
method treats these persistent error components separately in a smaller subspace with 
variables. Thus, this subspace should approximate the near kernel. The typical 
algebraic multigrid setup is then as follows: We have to find an operator Dc which 
resembles D on that subspace both in the sense that it acts on the near kernel in a 
similar manner as D does, but also in terms of the connection structure and sparsity. 
The latter allows to work on Dc recursively using the same approach, thus going from 
two-grid to true multigrid. We also need a linear map i? : C" — ;> C"= to restrict 
information from the original space to the subspace and a linear map P : C"'= — >■ C" 
which transports information back to the original space. The coarse grid correction 
for a current iterate z on the original space is then obtained by restricting the residual 
r = b — Dz to the subspace, there solving 

DcCc = Rr (3.1) 

and transporting the coarse error Cc back to the original space as a correction for z, 
resulting in the subspace correction 

z^z + PD-^Rr, r = b-Dz (3.2) 

with the corresponding error propagator 

/ - PD-^RD. 

Typically, the coarse grid system is obtained as the Petrov-Galerkin projection 
with respect to P and R, i.e., 

Dc RDP. 

The coarse grid correction / — P{RDP)^^ RD then is a projection onto range(i?L))^ 
along range(P). The action of the coarse grid correction is thus complementary to that 
of the smoother if range(P) approximately contains the near kernel and ra.nge{RD)^ 
approximately contains the remaining complementary eigenvectors (which are then 
efhciently reduced by the smoother). The latter condition is satisfied if range(i?) 
approximately contains the left eigenvectors corresponding to the small eigenvalues. 
This can be seen by looking at exact eigenvectors: Since left and right eigenvectors 
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are mutually orthogonal, if range(i?) ~ range(i?I?) is spanned by left eigenvectors of 
Z?, then range(_R)^ is spanned by the complementary right eigenvectors of D. 

Once Dc is found a basic two-level algorithm consists of alternating the application 
of the smoother and the coarse grid correction. This procedure can be recursively 
extended to true multigrid by formulating a two-level algorithm of this kind for the 



solution of (3.1 1 until we obtain an operator which is small enough to solve (3.11 
directly. 



To be computationally beneficial, solving (3.1) has to be much cheaper than 
solving the original equation Dz = h. For this purpose has to be very small or 
sparse. As the number of eigenvectors that are not sufficiently reduced by the SAP 
smoother grows with n, cf. |35j . one should not aim at fixing rtc (like in deflation 
methods), but at finding sparse matrices i? and P whose ranges approximate the 
small left and right near kernel of D well, respectively. 

3.1. Aggregation-based Intergrid Transfer Operators. Consider a block 
decomposition of the lattice £ with lattice-blocks It has been observed in [55] 
that eigenvectors belonging to small eigenvalues of D tend to (approximately) coin- 
cide on a large number of lattice-blocks £i, a phenomenon which was termed "local 
coherence" in j35j . As a consequence, we can represent many eigenvectors with small 
eigenvalues from just a few by decomposing them into the parts belonging to each 
of the lattice-blocks. This is the philosophy behind the so-called aggregation-based 
intergrid transfer operators introduced in a general setting in [51 [T3] and applied to 
QCD problems in [3 [H [SZj . 

Definition 3.1. An aggregation {yli,...yls} is a partitioning of the set V = 
C X C X S of all variables. It is termed a lattice-block based aggregation if each Ai is 
of the form 



Ai H 



X m 



where Cj are the lattice-blocks of a block decomposition of C and Wi QC x S. 



Aggregates for the lattice Wilson Dirac operator (2.5 1 will typically be realized 



as lattice-block based aggregates. Note that, however, the SAP smoother on the one 
hand and interpolation and restriction on the other hand do not have to be based on 
a common block decomposition of C. 

Starting from a set of test vectors {vi, . . . ,WAr} which represent the near kernel 
and a set of aggregates {Ai, . . . ,As}^ the interpolation P is obtained by decomposing 
the test vectors over the aggregates 



/r 



(^1 



vn) 



VL 



p 



\ 



\ Ai 
) As 



(3.3) 



More formally, each aggregate Ai induces N variables (i — 1)A^ + 1, . . AN on the 
coarse system, and we define 



1, 



1, 



(3.4) 
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Herein, T^. represents the trivial restriction operator for the aggregate Ai, i.e., T^.Vj 
leaves the components of Vj from Ai unchanged while zeroing all others, and e(i_i)jv+i 
denotes the [i — 1)N + j-th unit vector. For the sake of stability, the test vectors are 
orthonormalized locally, i.e., for each i we replace i^.Vj in (3.4 1 by the j-th basis 
vector of an orthonormal basis of span(I^.tii, . . . ,I^.wjv). This does not alter the 
range of P nor does it change the coarse grid correction operator / — P{RDP)~'^RD, 
and it ensures P^ P = I. 

The restriction R is obtained in an analogous manner by using a set of test vectors 
{vi, . . . ,vn} and the same aggregates to build R^ . Figure 3.1 illustrates a lattice- 
block based aggregation from a lattice point of view where in each aggregate A we take 
yVi as the whole set C x S, again reduced to two dimensions. Then the aggregates can 
be viewed as forming a new, coarse lattice and the sparsity and connection structure 
of Dc — RDP resembles the one of D, i.e., we have again a nearest neighbor coupling. 
Each lattice point of the coarse grid, i.e., each aggregate, holds N variables. 




Fig. 3.1. Aggregation-based interpolation (geometrical point of view reduced to 2D) 



3.2. Petrov-Galerkin Approach in Lattice QCD. The structure and the 
spectral properties of the Wilson Dirac operator D suggest to explicitly tie the re- 
striction R to the interpolation P. The following construction of P — and thus R — is 
similar to constructions found in O [T51 [351 133 in the sense that the structure of 
all these interpolation operators is similar while the test vectors Vi upon which the 
interpolation is built — and therefore the action of the operators — are different. 

Due to Lemma 12.11 it is natural to choose 

R = (T^Pf 

in the aggregation based intergrid operators: if P is built from vectors vi,...,V]\[ 
which approximate right eigenvectors to small eigenvalues of D, then R = (Tc^P)^ is 
built from vectors Vi = Tr,Vi which approximate left eigenvectors to small eigenvalues. 

As was pointed out in [31, it is furthermore possible to even obtain R — P^ by 
taking the special spin-structure of the Dirac operator into account when defining the 
aggregates. To be specific, we introduce the following definition. 

Definition 3.2. The aggregation {Ai, i — I, . . . , s} is termed Fs-compatible if 
any given aggregate Ai is composed exclusively of fine variables with spin and 1 or 
of fine variables with spin 2 and 3. 
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Assume that we have a Fs-compatible aggregation and consider the interpolation 



operator P from (3.3). Since Fs acts as the identity on spins and 1 and as the 
negative identity on spins 2 and 3, when going from P to T^P each of the non-zero 
blocks in P belonging to a specific aggregate is either multiplied by +1 or by — 1. 
This gives 

T,P = PTl (3.5) 

with r§ acting as the identity on the spin-O-l-aggregates and as the negative identity 
on the spin-2-3-aggregates. 

Lemma 3.3. Let the aggregation be Tc^- compatible and P the corresponding 



aggregation based prolongation as in (3.3) and R — (T^P)^ . Consider the two coarse 
grid operators 

Df G ^ jijjp^ jj^ ^ p"dP. 

Then 

(ii) I - PD-^P"D = 1 - P{D^^)-^RD. 
(Hi) D^'^ is hermitian, is Vi^- symmetric, 
(iv) For the field of values we have J-{Dc) Q J-{D). 

Proof. We first observe that just as r5 the matrix is diagonal with diagonal 
entries +1 or — 1, so r§ = {^5)^ ~ (^5)^^- Part (i) now follows from 

D^^ ^ RDP = {T^P)"DP = {PTl)"DP = TlP" DP = TID^. 

Consequently, 

P{D^^)-^RD = PD-^TIP"T^D = PD-^TITIP" D = PD-^P"D, 
which gives (ii). For part (iii) we observe that 

(^PG)H ^ pHjjHj^H ^ pHjjHj.^p ^ pH^^j^p ^ j^jyp ^ j^PG 

This shows that -D^*^ is hermitian, which is equivalent to Dc = T^D^'-^ being r|- 
symmetric. Finally, since P^ P — I, we have 

HD,) = {V^f Z?cV'c : i^ci^c = 1} = {{P^c)"D{Pi^,) : (P^,)^(P^,) = 1} 

C {^p^Di; : i^^ip = 1} = F{D), 

which gives (iv). □ 

Lemma [3.3| has some remarkable consequences. Part (ii) shows that we end up 
with the same coarse grid correction, irrespectively of whether we pursue a Petrov- 
Galerkin approach (matrix D^'^ with R = F5P) or a Galerkin approach (matrix 
Dc, restriction is the adjoint of the prolongation). The Petrov-Galerkin matrix D^'^ 
inherits the hcrmiticity of the matrix T^D, whereas the Galerkin matrix Dc inherits 



the Fs-symmetry (and thus the symmetry of the spectrum, see Lemma 2.1) of D. 
Moreover, if T{D) lies in the right half plane, then so docs J^{Dc) and thus the 
spectrum of Dc- It is known that the "symmetrized" Wilson Dirac operator Tc^D is 
close to maximally indefinite |25| . i.e., the number of negative eigenvalues is about 
the same as the positive ones. This property is also inherited by D^'^ . 
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r5-symmetry implies an interesting connection between the eigensystem of T^D 
and the singular values and vectors of D. Indeed, if 



T^D^VAV", A diagonal, V^V ^ I 
denotes the eigendecomposition of the hermitian matrix T^D, then 

D = (r5ysign(A)) |A| v" = uj:v" (3.6) 

is the singular value decomposition of D with the unitary matrix U = r5l/sign(A) 
and S = |A|. 

The theory of algebraic multigrid methods for non-hermitian problems recently 
developed in [T3] suggests to base interpolation and restriction on the right and left 
singular vectors corresponding to small singular values rather than on eigenvectors, so 



we could in principle use the relation (3.6). However, obtaining good approximations 



for the singular vectors belonging to small singular values is now much harder than 
obtaining good approximations to eigenvectors belonging to small eigenvalues, since 
the small singular values lie right in the middle of the spectrum ofT^D, whereas the 
small eigenvalues of D lie at the "border" of its spectrum (and in the right half plane 
C"*" if J^(D) C C"*"). Numerically we did not find that going after the singular values 
payed off with respect to the solver performance and it significantly increased the 
setup timing. These observations led us to the eigenvector based adaptive multigrid 
approach presented here; it also motivates that we consider rather than D^*^ as 
the "correct" coarse grid system to work with recursively in a true multigrid method. 

In our computations, we take special Fs-compatible, lattice-block based aggrega- 
tions. 

Definition 3.4. Let Cj,j — 1, • . . , sl be a block decomposition of the lattice C. 
Then the standard aggregation {Aj,a,j = 1, • . ■ , s^, a = 0,1} is given by 

Aj^o X {0, 1} X C, Aj,i = Cj X {2, 3} x C. 



Aggregates of the standard aggregation always combine two spin degrees of free- 
dom in a Fs-compatible manner and all three color degrees of freedom. For any given 
j, the two aggregates Aj^ and Aj^i are the two only aggregates associated with the 
lattice-block £j. The standard aggregates thus induce a coarse lattice £c with nc^ 
sites where each coarse lattice site corresponds to one lattice-block £j and holds 27V 
variables with N the number of test vectors. N variables correspond to spin and 1 
(and aggregate Ajfi); another N variables to spin 2 and 3 (and aggregate Aj^i). Thus 
the overall system size of the coarse system is Uc = 2Nnc^ ■ 

With standard aggregation, in addition to the properties listed in Lemma |3.3| 
the coarse system Dc = DP also preserves the property that coarse lattice points 
can be arranged as a 4D periodic lattice such that the system represents a nearest 
neighbor coupling on this torus. Each coarse lattice point now carries 2A^ variables. 

We also note that applying R and P to a vector does not require any communi- 
cation in a parallel implementation if whole aggregates are assigned to one process. 

3.3. Adaptivity in Aggregation-based AMG. If no a priori information 
about the near kernel is available, the test vectors vi, . . . ,vn to be used in an aggre- 
gation based multigrid method have to be obtained computationally during a setup 
phase. We now briefly review the setup concept of adaptive (smoothed) aggregation 
as described in [T31. We do so in the Galerkin context, i.e., we take R = P". The first 
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fundamental idea of adaptivity in algebraic niultigrid methods is to use the smoother 
to find error components not effectively reduced by the smoother, i.e., belonging to 
the near kernel. Starting with an initial random vector u, some iterations with the 
smoothing scheme on the homogeneous equations Du = yield a vector v rich in 
components that are not effectively reduced. The first set of test vectors then is the 
singleton {v}, and one constructs the corresponding aggregation based interpolation 



P from (3.3). This construction guarantees that v is in range(P) and thus is treated 
on the coarse grid. Once a first two- or multigrid method is constructed in this way, 
one can use it to generate an additional vector not effectively reduced by the current 
method by again iterating on the homogeneous system. This newly found vector is 
added to the set of test vectors upon which we build new interpolation and coarse grid 
operators. Continuing in this manner we ultimately end up with a multigrid method 
which converges rapidly, but possibly at a high computational cost for the setup if 
many vectors need to be generated and incorporated in the interpolation operator. 
To remedy this issue, already in [T^, some sophisticated ideas to filter the best infor- 
mation out of the produced vectors, are proposed which have been partly used in the 
implementations of adaptive algebraic multigrid for QCD described in [3l [121 EZ] • 

The use of the homogeneous equation Du = as the anvil to shape useful infor- 
mation by working on it with the most advanced method at hand, is the core idea of 
early adaptivity in algebraic multigrid methods. 

3.4. Adaptivity in Bootstrap AMG. It is possible to use the current multi- 
grid method in an adaptive setup in more ways than just to test it for deficiencies 
by applying it to the homogeneous equation Du = 0. This is done in the bootstrap 
approach pursued in |10l lllj which we explain now. The following observation is cru- 
cial: Given an eigenpair (vc, Ac) of the generalized eigenvalue problem on the coarse 
grid 

we observe that {Pv^, Xc) solves the constrained eigenvalue problem 

find {v, A) with v G range(F) s.t. P" [Dv - A-y) = 

on the fine grid. This observation allows to use the coarse grid system as a source 
of information about the eigenvectors to small eigenvalues of the fine grid system. 
Computing eigenvectors to small Ac on the coarse grid is cheaper than on the fine 
grid, and applying a few iterations of the smoother to the lifted vectors Pvc yields 
useful test vectors rich in components belonging to the near kernel of the fine grid 
system. As we will see, the setup process used in the "inexact deflation" approach 
from |35| . explained in the next section, can also be interpreted as a bootstrap-type 
setup, where instead of using an exact solution to the coarse grid eigenproblem only 
approximations are calculated. 

4. Multigrid and Inexact Deflation. A hierarchical approach for solving the 



Wilson Dirac equation ( 1.1 ), which lately received attention in the Lattice QCD com- 
munity, was proposed in [35) . It is a combination of what is called "inexact deflation" 
with an SAP preconditioned generalized conjugate residuals (GCR) method. The pa- 
per [35' does not relate its approach to the existing multigrid literature. The purpose 
of this section is to recast the formulations from [35, into established terminology 
from algebraic multigrid theory and to explain the limitations of the overall method 
from 35 which composes its multigrid ingredients in a non-optimal manner. We also 
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explain how the setup employed in |35j to construct the "inexact deflation subspace" 
(i.e., the test vectors) can be viewed and used as an approximate bootstrap setup in 
the sense of Section |3^ 



4.1. Inexact Deflation. The inexact deflation subspace constructed in [35^ is 
the range of a linear operator P which resembles the definition of aggregation based 



interpolation from (3.3). As in the aggregation-based construction it uses a set of 
test vectors vi, . . . ,vn which are "chopped" up over aggregates (called subdomains 
in [33]) to obtain the locally supported columns of P. These aggregates are not Tc,- 
compatible, so the Fs-symmetry is not preserved on the coarse grid operator Dc which 
is obtained as Dc = P^ DP. Since the inexact deflation approach is not meant to be 
recursively extended to a true multilevel method, preserving important properties of 
the fine system on the coarse system is of lesser concern. However, within its two-level 
framework a (purely algebraic) defiating technique is applied when solving the coarse 
system. 

Two projections ttl, tth are defined in |35j as follows 

TTL=I - DPD-^P" and nR = I - PD-^P"D. (4.1) 

Clearly tt/j is the the coarse grid correction introduced in section [3j cf. Lemma [33| (i) . 
In the context of inexact deflation |35j uses these projections and the relation Dtth = 
TT^-D to decompose the linear system of equations Dz = b as 

DtTrZ = TTib, D{I - 'Kp;)z = (/ - TTi)?). 

The second equation can be simphfied to (/ - ttii)z = PD^^P^b. Thus the solution 
z can be computed as z = ttj^z + (/ — ttj^)z = x + x'j where 

X' = PD-'P"b 

only requires the solution of the coarse grid system D^. and 

DX = DtTrX = TTifo 

is the "inexactly deflated" system which in [35] is solved by a right preconditioned 
Krylov subspace method. To be specific, the Krylov subspace is built for the operator 

DttrM^^Jp 

and the right hand side n^b, and the Krylov subspace method is GCR (general con- 
jugate residuals, cf. [40] ) . a minimum residual approach which automatically adapts 
itself to the fact that the preconditioner Mg^Jp is not stationary, see the discussion in 
section 12.31 

4.2. Comparison of Multigrid and Inexact Deflation. Although the ingre- 
dients of an aggregation based algebraic multigrid method as described in section |3] 
and of "inexact deflation" as described in the previous paragraph are the same, their 
composition makes the difference. In the multigrid context we combine the SAP 
smoothing iteration with the coarse grid correction such that it gives rise to the error 
propagator of a V-cycle with v post smoothing steps 

E^{I- M^sApD){I - PD-^P"D) . 
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Hence we obtain for one iteration of the V-cycle 



where z denotes the current iterate and r the current residual b — Dz, and 

Cf"^) = a4% + PD-^P" - M^slpDPD-^P" . (4.2) 
In terms of the projectors (4.1 ) this can be written as 



Using the multigrid method as a right precondit loner in the context of a Krylov sub- 
space method, the preconditioner is given by C^^\ and the subspace is built for DC^^\ 
We again should use a flexible Krylov subspace method such as flexible GMRES or 
GCR, since the smoother Msap is non-stationary and, moreover, we will solve the 
coarse system Dc only with low accuracy using some "inner iteration" in every step. 
The important point is that a rough approximation of the coarse grid correction in 



(4.2), i.e., the solution of systems with the matrix D~'^ at only low accuracy, will 
typically have only a negligible effect on the quality of the preconditioner, and it will 
certainly not hamper the convergence of the iterates towards the solution of the sys- 
tem since multiplications with the matrix D are done exactly. On the other hand, in 
the "inexact deflation" context the exact splitting of the solution z = x' + X with 

x'^PD-^P"b, D7TRX = ^Lb 

requires the same final accuracy for both x' and x- Therefore, when computing x' i the 
coarse grid system has to be solved with high accuracy. More importantly, also 
appears in tth which is part of the "deflated" matrix Dnji in the system for x- In the 
inexact deflation context, this system is solved using SAP as a preconditioner. While 
we can allow for a flexible and possibly inexact evaluation of the preconditioner, the 
accuracy with which we evaluate the non-preconditioned matrix DiTfj in every step will 
inevitably affect the accuracy attainable for x- As a consequence, in each iteration 
we have to solve the system with the matrix Dc arising in ttj^ with an accuracy 
comparable with the accuracy at which we want to obtain x (although the accuracy 
requirements could, in principle, be somewhat relaxed as the iteration proceeds due 
to results from [151 [iS]). 

The difference of the two approaches is now apparent. In the multigrid context 
we are able to relax on the accuracy of the coarse system, in inexact deflation we 
are not. Since the coarse grid system is still a large system, the work to solve it 
accurately will by far dominate the computational cost in each iteration in inexact 
deflation. In the multigrid context we are allowed to solve at only low accuracy without 
noticeably affecting the quality of the preconditioner, thus substantially reducing the 
computational cost of each iteration. Moreover, such a low accuracy solution can 
particularly efficiently be obtained by a recursive application of the two-grid approach, 
resulting in a true multigrid method. 

For a more detailed analysis of the connection between deffation methods (includ- 
ing inexact deflation) and multigrid approaches we refer to [211 EH US]: e.g. 

4.3. Adaptivity in the Setup of Inexact Deflation. To set up the inex- 
act deflation method we need a way to obtain test vectors to construct the inexact 
deflation operators. Once these vectors are found the method is completely deflned 
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(see section 4.1). In complete analogy to the discussion of adaptive algebraic multi- 



grid methods in sections |3.3| and |3.4[ these test vectors should contain information 
about the eigenvectors corresponding to small eigenvalues of the operator DMg^Jp, 
the preconditioned system. 

Though the setup proposed in 35J is similar in nature to the one described in 
section 3.3 it differs in one important way. Instead of working on the homogeneous 
equation D-ip — with a random initial guess to obtain the test vectors, it starts with 
a set of random test vectors "ipj approximately computes D~^il>j using SAP. The 
(approximate) multiplication with will amplify the components of ip belonging 
to the near kernel. These new vectors are now used to define P (and D^), yielding an 
inexact deflation method which can anew be used to approximately compute D~^ipj 
giving new vectors for P. The whole process is repeated several times; see Algorithm[2] 
for a detailed description where a total of Uin^ of these cycles is performed. 

Algorithm 2 Inexact deflation setup - IDsetup(7ii„„,i^) as used in 35 



1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 



Let vi, . . . ,vn £ C" be random test vectors 
for ?7 = 1 to 3 do 
for j = 1 to do 

end for 
end for 

for ?7 = 1 to riiny do 

(re-)construct P and Dc from current Wi, . . . , wat 
for j = 1 to A'^ do 



{m''sap^l+PD-^P")v, 



1 1". 1 1 
end for 

end for 



The update Vj •(— {M^'^pTTL+PD^ ^P^)vj in line 10 of the algorithm is equivalent 



to the application of the V-cycle iteration matrix C'^'^^ (cf. (4.2 )). It can be interpreted 
as one step of an iteration to solve Dv = vj with initial guess and iteration matrix 

This update of the te st ve ctors can also be viewed in terms of the bootstrap AMG 
setup outlined in section 



3.4 



While the first part of the update, Mg^JpTTLVj, is the 
application of a coarse grid correction followed by smoothing, i.e., a test to gauge the 
effectiveness of the method (cf. section 3.3 ), the second part of the update, PD^^P^vj 
is in range(P). In contrast to the bootstrap methodology (cf. section 3.4 1 where an 
update in range(P) is obtained by interpolating eigenvectors to small eigenvalues of 
Dc, in the inexact deflation variant these "optimal" vectors are only approximated. 

5. DD-aAMG. We now have all the ingredients available to describe our do- 
main decomposition/aggregation based adaptive algebraic multigrid (DD-aAMG) 
method for the Wilson Dirac operator (1.1). 

As its smoother we take M^^^p: i-^-, we perform v iterations of red-black Schwarz 
as formulated in Algorithm [l] Like i/, the underlying block decomposition of the 
lattice £ is a parameter to the method which we will specify in the experiments. 

The coarse system Dc is obtained as Dc = P^ DP, where P is an aggregation 
based prolongation obtained during the adaptive setup phase. The aggregates are 
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from a standard aggregation according to Definition |3.4[ implying that it is in par- 
ticular lattice-block based and Fs-compatible. Parameters of the aggregation are the 
underlying block decomposition of £ (which does not necessarily match the one under- 
lying the SAP smoother) and the test vectors ui, . . . , ujv upon which P is built. The 



coarse grid matrix Dc inherits all of the important properties of D, cf. Lemma 3.3 

We combine the smoothing iteration and the coarse grid correction into a standard 
1^-cycle with no pre- and v steps of post smoothing so that the iteration matrix of 



one y-cycle is given by C^'^^ from (4.2 1. Instead of using iterations with the y-cycle 



as a stand-alone solver, we run FGMRES, the flexible GMRES method (cf. piP) with 
one y-cycle used as a (right) preconditioner. 

It remains to specify how we perform the adaptive setup yielding the test vectors 
wi, . . . , w^r. Extensive testing showed that a modification of the inexact deflation setup 
(Algorithm [2| is the most efficient. The modification is a change in the update of the 
vectors Vj in the second half. Instead of doing one iteration with C*^"^^ and initial 
guess to approximately solve Dv — Vj, we use the currently available vector Vj as 
our initial guess, see Algorithm [3] 

Algorithm 3 DD-Q:AMG-setup(ni„„, t^) 

perform Algorithm [2] with line [lO] replaced by 



6. Numerical Results. We implemented the DD-aAMG method in the pro- 
gramming language C using the parallelization interface of MPI. Up to now, we only 
have an implementation of a two-grid method at hand for which the numerical ex- 
periments already show truly satisfactory results. As we will see, we spend a large 
amount of time solving the coarse grid system, though. This indicates the potential 
for additional speed up when the method is applied recursively in a true multigrid 
approach. We will come back to this point in the conclusions in Section [7] 

Our code is optimized to a certain extent, but certainly not to the extreme. 
As is customary in lattice QCD computations, we use a mixed precision approach 
where we perform the y-cycle of the preconditioner in single precision. Low level 
optimization (e.g., making use of the SSE- registers on Intel architectures) has not 
been considered, yet. All Krylov subspace methods (FGMRES, GCR, CG) have been 
implemented in a common framework with the same degree of optimization to allow 
for a standardized comparison of computing times. This is particularly relevant when 
we compare timings with CGNR, the CG method applied to the normal equation 

Dt/j = D^b. Let us note already at this stage that in CGNR we used the residual 
with respect to the original equation = b for the stopping criterion. This residual 
can be obtained at the cost of one AXPY operation in the CGNR code. We include a 
comparison with the inexact deflation approach, since an efficient implementation is 
publicly available. A further comparison with the multilevel approaches from ^ 137) 
would have required a very substantial additional implementation effort which is out 
of scope for this paper. 

A commonly used technique in lattice QCD computations is odd-even precondi- 
tioning. A lattice site x is called even if xi + X2 + X3 + X4 is even, else it is called odd. 
Due to the nearest neighbor coupling, the Wilson Dirac operator has the form 

7-j / -^ee ^eo \ 

" \Doe Doo) ' 
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if we order all even sites first. Herein, D^e and Dqo are block diagonal with 12 x 12 di- 
agonal blocks. Instead of solving a system with D we can solve the corresponding sys- 
tem for the odd lattice sites given by the Schur complement Ds = Dqo — DoeD~^Deo 
and then retrieve the solution at the even lattice sites, cf. [37j. The inverse 
is precomputed once for all, and the operator Ds is applied in factorized form. A 
matrix-vector multiplication with Ds thus requires the same work than one with D 
while the condition of -D5 improves over that of D. Typically, this results in a gain of 
2 — 3 in the number of iterations and execution time. Our results for CGNR do not 
use the odd-even reduced system, since an efficient implementation would require a 
storage scheme for the links U^{x) different from the one needed for DD-aAMG. We 
do, however, use odd-even preconditioning when we solve the coarse system involving 
Dc in DD-aAMG, which by default is done using odd-even preconditioned restarted 
GMRES with a restart length of 30. We implemented this odd-even preconditioning 
similarly in spirit to what was proposed for the Wilson Dirac operator in |31j . 



parameter default 

setup munbcr of iterations nj„„ 6 

number of test vectors N 20 

size of lattice-blocks for aggregates 4* 

coarse system relative residual tolerance 5 • 10^^ 
(stopping criterion for the coarse system) 

solver restart length of FGMRES riuv 25 

relative residual tolerance (stopping criterion) tol 10^^" 

smoother number of post smoothing steps'*^ v 5 

size of lattice-blocks in SAP**' 2* 
number of Minimal Residual (MR) iterations to 

solve the local systems ( [2^ in SAP'*' 3 
Table 6.1 



Parameters for the DD-aAMG two-level method. {*) : same in solver and setup 



Table |6.1| summarizes the default parameters used for DD-aAMG in our experi- 
ments. Besides those discussed in Section[5] the table also gives the stopping criterion 
used for the solves with the coarse system Dc (the initial residual is to be decreased by 
a factor of 20) and the stopping criterion for the entire FGMRES iteration (residual 
to be decreased by a factor of 10^"). In each SAP iteration we have to solve the local 
systems ( |2.8[ ). Instead of requiring a certain decrease in the residual we here fix the 
number of iterative steps (to 3). The iterative method we use here is the minimal 
residual method MR, i.e., restarted GMRES with a restart length of 1, where each 
iterative step is particularly cheap. 

For the various configurations and respective matrices we found that this default 
set of parameters yields a well performing solver, with only little room for further 
tuning. The size of the lattice-blocks (4"^ and 2'*) fits well with all lattice sizes occurring 
in practice, where Nt and Ng are multiples of 4. The number of setup iterations, ni„„, 
is the only one of these parameters which should be tuned. It will depend on how 
many systems we have to solve, i.e., how many right hand sides we have to treat. 
When Tiinv is increased, the setup becomes more costly, while, at the same time, the 
solver becomes faster. Thus the time spent in the setup has to be balanced with the 
number of right hand sides, and we will discuss this in some detail in Section [6. 2[ The 
default Uinv = 6 given in Table 6.1 should be regarded as a good compromise. 
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id 


lattice size 


pion mass 


CGNR 


shift 


clover 


provided by 






Nt X Nl 


[MeV] 


iterations 


mo 


term Csw 




1 




48 X 16^ 


250 


7,055 


-0.095300 


1.00000 


BMW-c HSJIH] 






48 X 243 


250 


11,664 


-0.095300 


1.00000 


BMW-c [20] [21] 






48 X 32^ 


250 


15,872 


-0.095300 


1.00000 


BMW-c [20] [21] 






48 X 48''' 


135 


53,932 


-0.099330 


1.00000 


BMW-c [20] [21] 






64 X 64^ 


135 


84,207 


-0.052940 


1.00000 


BMW-c [20] E] 






128 X 64^ 


270 


45,804 


-0.342623 


1.75150 


CLS [Il][l2] 



Table 6.2 

Configurations used together with their parameters. For details about their generation we refer 
to the references. Pion masses rounded to steps of 5 MeV. 



The configurations we used are listed in Table 6.2 In principle the pion mass rriTr 
and the lattice spacing (not listed) determine the condition of the respective matrix, 
e.g., the smaller m,r, the more ill conditioned is the respective matrix. The physical 
pion mass is mT^^^^^ — 135 MeV which is taken on by the configurations |4] and [Sj The 
conditioning of the matrices is indicated by the iteration count of CGNR to decrease 
the residual by a factor of 10^". 

We ran DD-aAMG on the various configurations, analyzed the behavior of the 
setup routine and performed different scaling test. All results have been computed on 
the Juropa machine at Jiilich Supercomputing Centre, a cluster with 2,208 compute 
nodes, each with two Intel Xeon X5570 (Nehalem-EP) quad-core processors. Unless 
stated otherwise the icc-compiler with the optimization fiag -03 was used. 

6.1. Comparison with CGNR. First we compare CGNR with the DD-aAMG 
method using the standard parameter set for a 64^ configuration at physical pion mass 
which represents an ill-conditioned linear system with n — 201,326,592. 





CGNR 


DD-qAMG 


speed up factor 


coarse grid 


setup time 




15.38s 




6.96s 


solve iter 


84,207 


20 




3,997(*) 


solve time 


816.26s 


5.17s 


157.9 


4.11s 


total time 


816.26s 


20.55s 


39.7 


11.07s 



Table 6.3 

CGNR vs. DD-aAMG with default parameters (Table [g. ij ) on an ill-conditioned 64* lattice 
(Tablew^ id\^, 8,192 cores, {*) : total iteration count summed up over all solver iterations. 



The results reported in Table |6.3| show that we obtain a speed-up factor of 40 
over CGNR with respect to the total timing. Excluding the setup time, we gain a 



factor of 158. The right most column of Table 6.3 shows that in this ill-conditioned 



case about 80% of the solve time go into computations on the coarse grid. 

6.2. Setup Evaluation. Lattice QCD computations are dominated by two ma- 
jor tasks: generating configurations within the Hybrid Monte-Carlo (HMC) algorithm 
[30] and evaluating these configurations, i.e., calculating observables. Both tasks re- 
quire solutions of the lattice Dirac equation. 

The HMC generates a sequence of stochastically independent configurations. The 
configuration is changed in every step, and the Wilson Dirac equation has to solved 
only once per configuration. Thus HMC requires a new setup — or at least an update — 
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for the interpolation and coarse grid operator in every step. Therefore the costs of 
setup/update and solve have to be well-balanced. 

The calculation of observables typically requires several solves for a single config- 
uration. Therefore one would be willing to invest more time into the setup in order 
to obtain a better solver. 



number of 


average 


average 


lowest 


highest 


average 


average 


setup 


setup 


iteration 


iteration 


iteration 


solver 


total 


steps Tlijiy 


timing 


count 


count 


count 


timing 


timing 


1 


2.02 


349.6 


343 


356 


19.67 


21.69 


2 


3.19 


122.2 


118 


127 


8.26 


11.45 


3 


4.63 


50.8 


50 


52 


4.38 


9.01 


4 


6.87 


31.3 


31 


32 


3.14 


10.01 


5 


10.24 


25.2 


25 


26 


2.68 


12.92 


6 


14.43 


23.0 


23 


23 


2.52 


16.95 


7 


18.30 


22.0 


22 


22 


2.55 


20.85 


8 


21.97 


21.7 


21 


22 


2.67 


24.64 


9 


24.97 


21.0 


21 


21 


2.73 


26.70 


10 


28.14 


21.3 


21 


22 


2.86 


31.00 


11 


31.05 


21.9 


21 


22 


3.05 


34.10 


12 


33.75 


22.1 


22 


23 


3.11 


36.86 



Table 6.4 

Evaluation of DD-aAMG-setup(ni,iy,5) cf. Algorithm^ 48* lattice, ill-conditioned configura- 
tion (Table\6^ id^, 2 ,592 cores, averaged over 10 runs. 



Table 6.4 illustrates how the ratio between setup and solve can be balanced de- 
pending on the amount of right hand sides. In this particular case 3 steps in the 
setup might be the best choice if only a single solution of the system is needed. For 
several right hand sides 6 steps might be the best choice. Doing up to 9 steps can 
lower the iteration count of the solver even further, but the better the test vectors 
approximate the near kernel, the more ill-conditioned the coarse system becomes, i.e., 
lowering the iteration count of the solver means increasing the iteration count on the 
coarse system. 

The numbers shown have been averaged over 10 runs, because the measurements 
vary due to the choice of random initial test vectors. The fourth and the fifth column 
of Table [63] show that the fiuctuations in the iteration count of the solver are modest. 



For riinv > 3 the fluctuations almost vanish completely. 







DD-aAMG iteration counts 






conf 1 


conf 2 conf 3 conf 4 conf 5 


conf 6 


3 


54 


58 56 55 54 


65 


6 


23 


24 24 24 23 


23 



Table 6.5 

Configuration dependence study of DD-aAMG with DD-aAMG-setupfnjny, 5), i ll-conditioned 
48* lattice for 6 different, ill-conditioned configurations on 48* lattices, (Table id^^, 2,592 

cores. 



Table [675| gives the iteration count of DD-aAMG for a set of 6 stochastically in- 
dependent configurations from a single HMC simulation. If we invest just 3 iterations 
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in the setup we observe a (tolerable) variation of the iteration count, whereas for 6 
iterations in the setup the iteration count becomes very stable. 



6.3. Scaling Tests. We now study the scaling behavior of the solver as a func- 
tion of the mass parameter and the system size. While the first determines the con- 
dition number of the Wilson Dirac operator, the second has an effect on the density 
of the eigenvalues. In particular, increasing the volume leads to a higher density of 
small eigenvalues [3]. In a weak scaling test we also analyze the performance as a 
function of the number of processors used. 

6.3.1. Mass Scaling. For this study we used a 48^ lattice configuration. We 
ran the setup once for the mass parameter mo = —0.09933 in the Wilson Dirac 
operator (2.4 1. This represents the most ill-conditioned system where the pion mass 
with 135 MeV is physical. We then used this interpolation operator for a variety of 
other mass parameters, where we then ran the DD-aAMG solver without any further 
setup. 





CGNR 


DD-a 


AMG 


coarse system 


mo 


iteration 


solver 


iteration 


solver 


;zfitcration 


timing 




count 


timing 


count 


timing 


count 


(% solve time) 


-0.03933 


1,597 


14.1 


15 


0.83 


10 


0.13 (15.5) 


-0.04933 


1,937 


17.2 


16 


0.89 


11 


0.15 (16.6) 


-0.05933 


2,454 


21.8 


17 


0.95 


13 


0.18 (18.9) 


-0.06933 


3,320 


29.4 


18 


1.04 


16 


0.22 (21.1) 


-0.07933 


5,102 


45.3 


18 


1.13 


20 


0.29 (25.1) 


-0.08933 


10,294 


91.5 


20 


1.44 


31 


0.50 (34.8) 


-0.09033 


11,305 


100.3 


20 


1.47 


33 


0.53 (35.8) 


-0.09133 


12,527 


111.2 


20 


1.43 


36 


0.53 (37.1) 


-0.09233 


14,009 


124.4 


20 


1.48 


38 


0.57 (38.4) 


-0.09333 


15,869 


141.3 


21 


1.68 


41 


0.67 (39.7) 


-0.09433 


18,608 


165.5 


21 


1.68 


45 


0.71 (42.2) 


-0.09533 


22,580 


201.2 


21 


1.70 


49 


0.75 (43.9) 


-0.09633 


27,434 


244.4 


21 


1.79 


54 


0.82 (45.7) 


-0.09733 


33,276 


296.5 


22 


2.15 


63 


1.08 (50.4) 


-0.09833 


42,067 


373.7 


22 


2.30 


74 


1.24 (53.9) 


-0.09933 


53,932 


480.4 


23 


2.60 


86 


1.49 (57.4) 


Table 6.6 



Mass scaling behavior of DD-aAMG, 48* lattice (Table \6^ id\^, 2,592 cores. 



In Table 6.6 we compare CGNR and DD-aAMG with respect to the timing for 
one right hand side and the scaling with the mass parameter mg. For the smallest 
mo, DD-aAMG is 185 times faster than CGNR and even for the largest value of mo 
there remains a factor of 17. We also see that the two methods scale in a completely 
different manner. The CGNR solve for the smallest mo is 34 times more expensive 
than the solve for the largest one. On the other hand the DD-aAMG timings just 
increases by a factor of 3.1, the iteration count even only by a factor of 1.5. The 
coarse grid iteration count, however, increases by a factor of 8.6. 

6.3.2. Scaling of the Error. In all results shown so far the stopping criterion 
was based on the norm of the residual r, since this is a quantity which is available 
computationally. Ultimately we are interested in the norm of the error e. Since 
we have r = De the error components belonging to small eigenvalues of D have 
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strong influence on the solution and weak influence on the residual. Obviously, ||e|| < 
IID^^II • ||r||, and in the worst case we have equality. As the norm of tends to be 
quite large for our ill-conditioned systems, it is worthwhile to compare CGNR with 
our two-grid method. 




Fig. 6.1. Behavior of the error as a function ofmo- 



We studied systems with different mass parameters mo with predetermined so- 
lutions and compared the error norms obtained by both methods with the stopping 
criterion to reduce the norm of the initial residual by 10^*^. Figure 6.1 shows that 
the error of DD-aAMG tends to be one order of magnitude better than the error of 
CGNR. Both errors show almost the same scaling behavior as a function of TOo,with 
the error becoming larger as the system gets more ill-conditioned. 

6.3.3. System Size Scaling. In Table [677| we report tests on the scaling with 
the system size for constant mass parameter and (physical) lattice spacing. We again 
compare DD-aAMG with CGNR. The iteration count of CGNR for Nt x lattices 
appears to scale with Ng and thus approximately doubles from Ng — 16 to Ng = 32. 
For DD-aAMG we observe a scaling of the iteration count with log{Ng). The solver 
timing scales somewhat less well due to the increased time spent on the coarse grid. 
Overall, the time to solution for CGNR increases by a factor of 2.36, whereas for 
DD-aAMG it is only a factor of 1.51. 



lattice size 
Nt X 


CGNR 




DD-aAMG 




iteration 
count 


solver 
timing 


setup 
timing 


iteration 
count 


solver 
timing 


48 X 16''' 


7,055 


55.9s 


4.14 


22 


1.32 


48 X 24^ 


11,664 


96.2s 


4.22 


26 


1.65 


48 X 323 


15,872 


131.9s 


4.33 


30 


1.99 



Table 6.7 

Lattice size scaling of DD-aAMG, nj jiy — 3 setup iterations, lattices generated with the same 
mass parameter and lattice spacing (Table\6J^ id\l\\M and\^ , local lattice size 4 X 8"^. 



6.3.4. Weak Scaling. For a weak scaling test we ran 100 iterations of DD- 
aAMG with riinv = 3 in the setup on lattices ranging from size 16^ • 8^ on a single 
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node (8 cores/node) to 64"* on 1,024 nodes with 4 • 8^ local lattice size on each core, 
cf. Figure [Ol 



20 



DD-aAMG-setup(3,5 
DD-aAMG(100,5 
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Fig. 6.2. Weak scaling test of DD-aAMG. The lattice size is increased with the number of 
processes, keeping the local lattice size per process fixed to 4 ■ 8'^ . 



For the scaling study we fixed the number of iterations on the coarse grid to 
be exactly 50 steps of odd-even preconditioned GMRES so that we always have the 
same number of 100 MPI_Allreduce operations. We therefore see the usual log(p) 
dependence, p the number of processes, caused by global communication. Apart from 
this, our method scales well up to 8,192 processes. 

6.4. Comparison with the Inexact Deflation Method. Since the inexact 
deflation code of ^SS, is publicly available [33,, we can compare its performance with 
DD-aAMG. The standard size for the lattice-blocks in the SAP iteration in the inexact 
deflation code is 4^, which is what we used in our experiments. The local systems 
(2.8) on the lattice-blocks are solved in [33] via the odd-even preconditioned minimal 



residual (MR) method with the iteration number fixed to 4. This ensures that the local 
systems are solved accurately enough with relatively low computational cost and turns 
out to be optimal or nearly optimal for the overall execution time. These parameters 
for inexact deflation arc thus different from the ones wc use in DD-aAMG, where the 
lattice-block size is 2^ and 3 steps of the non-preconditioned minimal residual method 
are performed. 

6.4.1. Comparison Without Low-level Optimization. The following re- 
sults were produced on the same 48^ lattice as in Sections |6.2| and |6.3.l| and on a 
128 X 64^ lattice (Table 



6.2 



id 6|. Except for the coarse grid tolerance and the SAP 



block size we have chosen a common set of parameters (cf. Table 6.1) and the Intel 
ice compiler with optimization flag -03. 

Table [678| compares inexact deflation and DD-aAMG for a whole range for rti„„, 
the number of iterations in the setup. We see that the DD-aAMG provides the fastest 
solver for riinv — 6 requiring 2.50s. The inexact deflation setup provides the fastest 
solver for rtinv = 10, but this solver needs a factor of 2.2 more time than the DD- 
aAMG solver and the setup by itself is also (slightly) more expensive. If we look at 
the time for setup and solve for one right hand side, riinv = 3 is best for the DD- 
aAMG method, where we obtain a total timing of 9.01s. Here, the best choice for 
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Inexact deflation 






DD-aAMG 




setup 


setup 


iteration 


solver 


setup 


iteration 


solver 


steps Tlin-i^ 


timing 


count (coarse) 


tuning 


timing 


count (coarse) 


timing 


1 


2.03s 


233 (82) 


18.3s 


1.99s 


350 (12) 


19.7s 


2 


3.19s 


155 (145) 


14.7s 


3.17s 


120 (29) 


8.05s 


3 


4.45s 


108 (224) 


12.1s 


4.58s 


52 (54) 


4.43s 


4 


5.88s 


84 (301) 


10.5s 


6.95s 


32 (74) 


3.28s 


5 


7.71s 


70 (320) 


8.86s 


10.3s 


25 (81) 


2.60s 


6 


9.22s 


63 (282) 


7.30s 


14.2s 


23 (86) 


2.50s 


7 


10.6s 


58 (277) 


6.53s 


18.3s 


22 (100) 


2.62s 


8 


12.3s 


54 (267) 


6.07s 


22.7s 


22 (102) 


2.61s 


9 


14.1s 


51 (263) 


5.53s 


24.9s 


21 (116) 


2.70s 


10 


15.9s 


49 (265) 


5.44s 


27.6s 


21 (127) 


2.83s 


11 


17.5s 


50 (266) 


5.48s 


30.6s 


22 (129) 


3.06s 


12 


19.5s 


53 (254) 


5.72s 


33.8s 


23 (131) 


3.27s 



Table 6.8 



Comparison with fair compiler settings of DD-aAMG with inexact deflation, both with 5 SAP 
iterations in each setup step, both methods tuned equally, except SAP block size 4* with MR ( 4 ) and 
coarse system solver tolerance lO"-'^ in ID, ill-conditioned system on a 48* lattice (Table \6^ id^^, 
2,592 cores. 



inexact deflation is riinv = 6 with a total timing of 16.2s. 

We also see that except for very small values for ni„„, the number of iterations 
required in DD-aAMG is less than half of that in inexact deflation. The numbers in 
parenthesis denote the average number of coarse solver iterations in each iteration of 
the respective method. For DD-aAMG the number of iterations on the coarse grid 
increases with the work spent in the setup. Hence, the lowest DD-aAMG-iteration 
count does not necessarily provide the fastest solver in the two grid setting. This might 
change, however, in a true multigrid approach. In inexact deflation the number of 
iterations on the coarse grid is not that clearly tied to ni„„ . Since in inexact deflation 
the coarse system has to be solved very accurately, the number of iterations needed 
to solve the coarse grid system is higher than in DD-aAMG. It is only moderately 
(a factor of 2 to 4) higher, though, because the code from [53] uses an additional 
adaptively computed preconditioner for the GCR iterations on the coarse system, 
whereas we just use the less efRcient odd-even preconditioning in DD-aAMG. 

For the same number of test vectors, DD-aAMG produces a coarse system twice 
as large as that of inexact deflation with the benefit of preserving the Fs structure on 
the coarse grid. The DD-aAMG coarse grid system seems to be more ill-conditioned — 
an indication that the important aspects of the fine grid system are represented on the 
coarse grid — and the resulting coarse grid correction clearly lowers the total iteration 
count more efficiently and thus speeds up the whole method. 

We also compared both methods in Table 6.9 for another configuration typical for 
many recent lattice QCD computations. The distribution function for the links U^{x) 
in configuration [6] is quite different from that in configuration [4] from Table [6^ Again 
we took the default parameter set, but now with a relatively cheap setup {riinv = 4). 
This results in a speed up factor of more than 1.5 for setup and solve in DD-aAMG 
against inexact deflation. Still 56% of the execution time is spent in solves of the 
coarse system in DD-aAMG. 
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Inexact deflation 


DD-aAMG 


speed up factor 


setup iter 


6 


4 




setup time 


22.3s 


14.8s 


1.51 


solve iter 


37 


40 




solve time 


16.6s 


10.1s 


1.64 


total time 


38.9s 


24.9s 


1.56 



Table 6.9 

Comparison of DD-aAMG with inexact deflation on an ill-conditioned system on a 128 X 64^ 
lattice (Table\6^ id\^, same parameters as in Table [fiTg} 8,192 cores. 







Inexact deflation 






DD-aAMG 




setup 


setup 


iteration 


solver 


setup 


iteration 


solver 


steps (n,j„,„) 


timing 


count (coarse) 


timing 


timing 


count (coarse) 


timing 


1 


1.01s 


233 (82) 


10.1s 


1.78s 


350 (12) 


19.1s 


2 


1.87s 


155 (145) 


10.2s 


2.76s 


122 (29) 


7.66s 


3 


2.69s 


108 (224) 


9.96s 


4.33s 


51 (55) 


4.45s 


4 


3.43s 


84 (301) 


9.25s 


6.47s 


31 (73) 


2.69s 


5 


6.14s 


70 (320) 


7.50s 


9.02s 


25 (80) 


2.54s 


6 


5.68s 


63 (282) 


5.21s 


13.5s 


23 (86) 


2.49s 


7 


6.93s 


58 (277) 


4.67s 


16.6s 


22 (96) 


2.23s 


8 


7.71s 


54 (267) 


4.12s 


20.5s 


22 (108) 


2.35s 


9 


8.74s 


51 (263) 


3.89s 


21.7s 


21 (118) 


2.62s 


10 


10.1s 


49 (265) 


3.62s 


25.2s 


21 (126) 


2.77s 


11 


11.3s 


50 (266) 


3.77s 


28.7s 


22 (129) 


3.08s 


12 


12.6s 


53 (254) 


4.13s 


32.5s 


22 (132) 


2.69s 



Table 6.10 

Comparison with best compiler settings for each method using the same set of parameters as in 
Tablele^ 



6.4.2. Comparison With Low- level Optimization. The inexact deflation 
code provides assembly coded parts which allow to optimally use the SSE registers 
of Intel architectures. For the sake of completeness we have repeated the runs from 



Table 6.8 for both methods with the respectively best compiler options available. This 
means that we used the gcc compiler with the -03 flag and customized SSE optimiza- 
tion for the inexact deflation method and the ice compiler with the optimization 
flags -03 -ipo -axSSE4.2 -m64 for for the DD-aAMG method. Since our focus is 
on algorithmic improvements we did not work on customized SSE optimization for 
DD-aAMG, although this should, in principle, give additional speed-up. Table [6T0| 
shows that a DD-aAMG setup with riinv ~ 7 provides the fastest DD-aAMG solver 
which is still 1.6 times faster than the fastest inexact deflation solver for riinv = 10. 
The sum of DD-aAMG setup and solver timing is minimized for riinv = 3 and this is 
still 1.2 times less than the minimum of the inexact deflation setup and solver timing 

for Hinv = 6. 

7. Conclusions and Outlook. The developed method, combining domain de- 
composition techniques and algebraic multigrid, shows great potential to speed up 
calculations in lattice QCD. For small quark masses and large lattice sizes our method 
outperforms conventional Krylov subspace methods and also the fastest publicly avail- 
able solver for many right hand sides as well as for a single right hand side. This 
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result is mainly due to the introduction of the highly parallel domain decomposition 
smoother and the efficient setup procedure into the algebraic multigrid method. Ad- 
ditional improvements can be expected by a recursive application in a true multigrid 
method. For example, if in the situation of Table 6.3 true multigrid reduces the part 
of the work spent on the coarse grid from 80% to 20% — so that it is well balanced with 
the 20% spent on the fine grid — we obtain another speed-up factor of 2.5. In addition, 
true multigrid should also lead to significant improvements when moving towards ever 
larger lattice sizes and we expect it to improve the scaling with respect to the mass 
parameter and the system size. Finally, in accordance with observations for multigrid 
methods for elliptic PDEs made in I47j . we can speculate on further improvements 
over the observations in Figure [6?l] i.e., that in a true multigrid method the norm of 
the error is even closer to that of the residual than in the two-grid method. 

Another factor of 1.5 - 2 could probably be obtained by machine specific optimiza- 
tion (SSE/AVX/QPX). We are currently working on the implementation of additional 
levels and incorporating our algorithm into the production codes of our collaborators 
within SFB/TRR55. Furthermore we are planning to investigate how to incorpo- 
rate our DD-qAMG method into the Hybrid Monte Carlo Method, i.e., updating the 
multigrid hierarchy in a cost efficient way. 
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